# Multimodal Agent

GUI Actor 7B Qwen2 VL
MIT
GUI-Actor-7B is a vision-language model developed based on Qwen2-VL-7B-Instruct, focusing on graphical user interface (GUI) agent tasks and providing a coordinate-free visual grounding solution.
Multimodal Fusion Transformers
G
microsoft
207
14
Qwen2.5 VL 7B Instruct GGUF
Apache-2.0
Qwen2.5-VL is the latest vision-language model from the Qwen family, featuring powerful visual understanding and multimodal processing capabilities, supporting image and video analysis with structured output.
Image-to-Text English
Q
unsloth
8,427
4
Gemma 3 R1984 4B
Gemma3-R1984-4B is a powerful agent AI platform built upon Google's Gemma-3-4B model, supporting multimodal file processing and deep research capabilities.
Image-to-Text Transformers Supports Multiple Languages
G
ginipick
44
4
Videomind 7B
Bsd-3-clause
VideoMind is a multimodal agent framework that enhances video reasoning by simulating human thought processes.
Video-to-Text
V
yeliudev
90
2
Magma 8B
MIT
Magma is a foundational multimodal AI agent model capable of processing image and text inputs to generate text outputs, with complex interaction abilities in both virtual and real-world environments.
Image-to-Text Transformers
M
microsoft
4,526
363
Omniparser V2.0
MIT
OmniParser is a universal screen parsing tool capable of interpreting/converting UI screenshots into structured formats to enhance LLM-based UI agent performance.
Image-to-Text Transformers
O
microsoft
6,729
1,185
Qwen2.5 VL 3B Instruct 4bit
Qwen2.5-VL is the latest vision-language model in the Qwen family, featuring enhanced visual understanding, agent capabilities, and long video processing.
Text-to-Image Transformers English
Q
jarvisvasu
174
3
Fuyu 8b
Fuyu-8B is a multimodal text-image transformer developed by Adept AI, designed for digital agents, supporting arbitrary image resolutions with swift responses and a streamlined architecture.
Image-to-Text Transformers
F
adept
14.22k
1,006
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase